The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Thu Jun 18 14:41:00 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Thu Jun 18 14:41:00 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.17.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.17.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.17.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.17.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 148 148
Albania 148 148
Algeria 148 148
Andorra 148 148
Angola 148 148
Antigua and Barbuda 148 148
Argentina 148 148
Armenia 148 148
Australia 1184 1184
Austria 148 148
Azerbaijan 148 148
Bahamas 148 148
Bahrain 148 148
Bangladesh 148 148
Barbados 148 148
Belarus 148 148
Belgium 148 148
Belize 148 148
Benin 148 148
Bhutan 148 148
Bolivia 148 148
Bosnia and Herzegovina 148 148
Botswana 148 148
Brazil 148 148
Brunei 148 148
Bulgaria 148 148
Burkina Faso 148 148
Burma 148 148
Burundi 148 148
Cabo Verde 148 148
Cambodia 148 148
Cameroon 148 148
Canada 2072 2072
Central African Republic 148 148
Chad 148 148
Chile 148 148
China 4884 4884
Colombia 148 148
Comoros 148 148
Congo (Brazzaville) 148 148
Congo (Kinshasa) 148 148
Costa Rica 148 148
Cote d’Ivoire 148 148
Croatia 148 148
Cuba 148 148
Cyprus 148 148
Czechia 148 148
Denmark 444 444
Diamond Princess 148 148
Djibouti 148 148
Dominica 148 148
Dominican Republic 148 148
Ecuador 148 148
Egypt 148 148
El Salvador 148 148
Equatorial Guinea 148 148
Eritrea 148 148
Estonia 148 148
Eswatini 148 148
Ethiopia 148 148
Fiji 148 148
Finland 148 148
France 1628 1628
Gabon 148 148
Gambia 148 148
Georgia 148 148
Germany 148 148
Ghana 148 148
Greece 148 148
Grenada 148 148
Guatemala 148 148
Guinea 148 148
Guinea-Bissau 148 148
Guyana 148 148
Haiti 148 148
Holy See 148 148
Honduras 148 148
Hungary 148 148
Iceland 148 148
India 148 148
Indonesia 148 148
Iran 148 148
Iraq 148 148
Ireland 148 148
Israel 148 148
Italy 148 148
Jamaica 148 148
Japan 148 148
Jordan 148 148
Kazakhstan 148 148
Kenya 148 148
Korea, South 148 148
Kosovo 148 148
Kuwait 148 148
Kyrgyzstan 148 148
Laos 148 148
Latvia 148 148
Lebanon 148 148
Lesotho 148 148
Liberia 148 148
Libya 148 148
Liechtenstein 148 148
Lithuania 148 148
Luxembourg 148 148
Madagascar 148 148
Malawi 148 148
Malaysia 148 148
Maldives 148 148
Mali 148 148
Malta 148 148
Mauritania 148 148
Mauritius 148 148
Mexico 148 148
Moldova 148 148
Monaco 148 148
Mongolia 148 148
Montenegro 148 148
Morocco 148 148
Mozambique 148 148
MS Zaandam 148 148
Namibia 148 148
Nepal 148 148
Netherlands 740 740
New Zealand 148 148
Nicaragua 148 148
Niger 148 148
Nigeria 148 148
North Macedonia 148 148
Norway 148 148
Oman 148 148
Pakistan 148 148
Panama 148 148
Papua New Guinea 148 148
Paraguay 148 148
Peru 148 148
Philippines 148 148
Poland 148 148
Portugal 148 148
Qatar 148 148
Romania 148 148
Russia 148 148
Rwanda 148 148
Saint Kitts and Nevis 148 148
Saint Lucia 148 148
Saint Vincent and the Grenadines 148 148
San Marino 148 148
Sao Tome and Principe 148 148
Saudi Arabia 148 148
Senegal 148 148
Serbia 148 148
Seychelles 148 148
Sierra Leone 148 148
Singapore 148 148
Slovakia 148 148
Slovenia 148 148
Somalia 148 148
South Africa 148 148
South Sudan 148 148
Spain 148 148
Sri Lanka 148 148
Sudan 148 148
Suriname 148 148
Sweden 148 148
Switzerland 148 148
Syria 148 148
Taiwan* 148 148
Tajikistan 148 148
Tanzania 148 148
Thailand 148 148
Timor-Leste 148 148
Togo 148 148
Trinidad and Tobago 148 148
Tunisia 148 148
Turkey 148 148
Uganda 148 148
Ukraine 148 148
United Arab Emirates 148 148
United Kingdom 1628 1628
Uruguay 148 148
US 148 148
US_state 482628 482628
Uzbekistan 148 148
Venezuela 148 148
Vietnam 148 148
West Bank and Gaza 148 148
Western Sahara 148 148
Yemen 148 148
Zambia 148 148
Zimbabwe 148 148
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5757
Alaska 1162
Arizona 1406
Arkansas 6105
California 5358
Colorado 5207
Connecticut 835
Delaware 351
Diamond Princess 93
District of Columbia 94
Florida 6062
Georgia 13713
Grand Princess 94
Guam 94
Hawaii 482
Idaho 2794
Illinois 8072
Indiana 7941
Iowa 7580
Kansas 6422
Kentucky 9106
Louisiana 5710
Maine 1457
Maryland 2201
Massachusetts 1394
Michigan 7078
Minnesota 6739
Mississippi 7087
Missouri 8152
Montana 2502
Nebraska 4951
Nevada 1131
New Hampshire 994
New Jersey 2103
New Mexico 2484
New York 5367
North Carolina 8436
North Dakota 3065
Northern Mariana Islands 79
Ohio 7501
Oklahoma 5823
Oregon 2889
Pennsylvania 5874
Puerto Rico 94
Rhode Island 556
South Carolina 4112
South Dakota 3885
Tennessee 8144
Texas 17710
Utah 1279
Vermont 1335
Virgin Islands 94
Virginia 10714
Washington 3658
West Virginia 4022
Wisconsin 5786
Wyoming 1803
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 63826
China 148
Diamond Princess 129
Korea, South 119
Japan 118
Italy 116
Iran 113
Singapore 110
France 109
Germany 109
Spain 108
US 106
Switzerland 105
United Kingdom 105
Belgium 104
Netherlands 104
Norway 104
Sweden 104
Austria 102
Malaysia 101
Australia 100
Bahrain 100
Denmark 100
Canada 99
Qatar 99
Iceland 98
Brazil 97
Czechia 97
Finland 97
Greece 97
Iraq 97
Israel 97
Portugal 97
Slovenia 97
Egypt 96
Estonia 96
India 96
Ireland 96
Kuwait 96
Philippines 96
Poland 96
Romania 96
Saudi Arabia 96
Indonesia 95
Lebanon 95
Pakistan 95
San Marino 95
Thailand 95
Chile 94
Luxembourg 93
Peru 93
Russia 93
Ecuador 92
Mexico 92
Slovakia 92
South Africa 92
United Arab Emirates 92
Armenia 91
Colombia 91
Croatia 91
Panama 91
Serbia 91
Taiwan* 91
Turkey 91
Argentina 90
Bulgaria 90
Latvia 90
Uruguay 90
Algeria 89
Costa Rica 89
Dominican Republic 89
Hungary 89
Andorra 88
Bosnia and Herzegovina 88
Jordan 88
Lithuania 88
Morocco 88
New Zealand 88
North Macedonia 88
Vietnam 88
Albania 87
Cyprus 87
Malta 87
Moldova 87
Brunei 86
Burkina Faso 86
Sri Lanka 86
Tunisia 86
Ukraine 85
Azerbaijan 84
Ghana 84
Kazakhstan 84
Oman 84
Senegal 84
Venezuela 84
Afghanistan 83
Cote d’Ivoire 83
Cuba 82
Mauritius 82
Uzbekistan 82
Cambodia 81
Cameroon 81
Honduras 81
Nigeria 81
West Bank and Gaza 81
Belarus 80
Georgia 80
Bolivia 79
Kosovo 79
Kyrgyzstan 79
Montenegro 79
Congo (Kinshasa) 78
Kenya 77
Niger 76
Guinea 75
Rwanda 75
Trinidad and Tobago 75
Paraguay 74
Bangladesh 73
Djibouti 71
El Salvador 70
Guatemala 69
Madagascar 68
Mali 67
Congo (Brazzaville) 64
Jamaica 64
Gabon 62
Somalia 62
Tanzania 62
Ethiopia 61
Burma 60
Sudan 59
Liberia 58
Maldives 56
Equatorial Guinea 55
Cabo Verde 53
Sierra Leone 51
Guinea-Bissau 50
Togo 50
Zambia 49
Eswatini 48
Chad 47
Tajikistan 46
Haiti 44
Sao Tome and Principe 44
Benin 42
Nepal 42
Uganda 42
Central African Republic 41
South Sudan 41
Guyana 39
Mozambique 38
Yemen 34
Mongolia 33
Mauritania 30
Nicaragua 30
Malawi 24
Syria 24
Zimbabwe 22
Bahamas 21
Libya 21
Comoros 19
Suriname 11
Angola 8
Eritrea 3
Burundi 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 148
Korea, South 119
Japan 118
Italy 116
Iran 113
Singapore 110
France 109
Germany 109
Spain 108
US 106
Switzerland 105
United Kingdom 105
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-04-17        18369
## 2    Afghanistan           <NA> <NA> 2020-04-16        18368
## 3    Afghanistan           <NA> <NA> 2020-04-15        18367
## 4    Afghanistan           <NA> <NA> 2020-06-02        18415
## 5    Afghanistan           <NA> <NA> 2020-04-26        18378
## 6    Afghanistan           <NA> <NA> 2020-04-19        18371
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                     30                   906     0.03311258
## 2                     30                   840     0.03571429
## 3                     25                   784     0.03188776
## 4                    270                 16509     0.01635472
## 5                     50                  1531     0.03265839
## 6                     33                   996     0.03313253
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  2.957128                   1.477121        18348
## 2                  2.924279                   1.477121        18348
## 3                  2.894316                   1.397940        18348
## 4                  4.217721                   2.431364        18348
## 5                  3.184975                   1.698970        18348
## 6                  2.998259                   1.518514        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             21  NA   NA         NA                           NA
## 2             20  NA   NA         NA                           NA
## 3             19  NA   NA         NA                           NA
## 4             67  NA   NA         NA                           NA
## 5             30  NA   NA         NA                           NA
## 6             23  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9947636 -34.0
Hawaii Retail_Recreation 0.9729533 -56.0
New Hampshire Parks 0.9510318 -20.0
Vermont Parks 0.9354611 -35.5
Maine Transit -0.9149105 -50.0
Hawaii Parks 0.8909425 -72.0
Connecticut Grocery_Pharmacy -0.8892043 -6.0
Utah Residential -0.8763111 12.0
Hawaii Transit 0.8505994 -89.0
Utah Transit -0.8459559 -18.0
South Dakota Parks 0.8024174 -26.0
Arizona Grocery_Pharmacy -0.7764535 -15.0
Wyoming Parks -0.7722270 -4.0
Rhode Island Workplace -0.7677330 -39.5
Connecticut Transit -0.7652498 -50.0
Massachusetts Workplace -0.7520500 -39.0
Alaska Grocery_Pharmacy -0.7441638 -7.0
Alaska Workplace -0.7158284 -33.0
Arizona Residential 0.7144825 13.0
Alaska Residential 0.6709091 13.0
Nevada Transit -0.6555946 -20.0
New York Workplace -0.6529864 -34.5
Vermont Grocery_Pharmacy -0.6527134 -25.0
North Dakota Parks 0.6435521 -34.0
Rhode Island Retail_Recreation -0.6311475 -45.0
Utah Parks -0.6164639 17.0
New Jersey Parks -0.6032447 -6.0
Rhode Island Residential -0.5983355 18.5
Idaho Residential -0.5981256 11.0
Maine Workplace -0.5918860 -30.0
New York Retail_Recreation -0.5899778 -46.0
Nebraska Workplace 0.5818339 -32.0
Utah Workplace -0.5549784 -37.0
Arizona Retail_Recreation -0.5397469 -42.5
New Jersey Workplace -0.5329491 -44.0
New York Parks 0.5293822 20.0
Connecticut Retail_Recreation -0.5285466 -45.0
Hawaii Residential -0.5258143 19.0
Connecticut Residential 0.5222568 14.0
Massachusetts Retail_Recreation -0.4963927 -44.0
Arkansas Parks 0.4919159 -12.0
Maine Parks 0.4904383 -31.0
Connecticut Workplace -0.4866511 -39.0
New Jersey Grocery_Pharmacy -0.4865297 2.5
New Mexico Grocery_Pharmacy -0.4833580 -11.0
West Virginia Parks 0.4687649 -33.0
Nebraska Residential -0.4658619 14.0
Arizona Transit 0.4656115 -38.0
Iowa Parks -0.4627565 28.5
Montana Parks -0.4624845 -58.0
New Mexico Residential 0.4560046 13.5
Maryland Workplace -0.4547760 -35.0
Rhode Island Parks 0.4481230 52.0
Kansas Parks 0.4404620 72.0
North Dakota Retail_Recreation -0.4392048 -42.0
Illinois Transit -0.4282323 -31.0
New Jersey Retail_Recreation -0.4232441 -62.5
Pennsylvania Workplace -0.4197047 -36.0
Montana Residential 0.4142652 14.0
New Mexico Parks 0.4108262 -31.5
Massachusetts Grocery_Pharmacy -0.4080797 -7.0
South Carolina Workplace 0.4074137 -30.0
Maryland Grocery_Pharmacy -0.4072276 -10.0
New Jersey Transit -0.4068095 -50.5
Vermont Residential 0.4056732 11.5
New Hampshire Residential -0.3957885 14.0
Kentucky Parks -0.3956341 28.5
Pennsylvania Retail_Recreation -0.3931098 -45.0
Michigan Parks 0.3872308 28.5
Alabama Workplace -0.3841913 -29.0
New Mexico Retail_Recreation -0.3729156 -42.5
Oregon Parks -0.3683446 16.5
New York Transit -0.3672879 -48.0
Alabama Grocery_Pharmacy -0.3672738 -2.0
Missouri Residential -0.3540422 13.0
Nebraska Grocery_Pharmacy 0.3419487 -0.5
Wisconsin Transit -0.3384426 -23.5
North Dakota Workplace 0.3322395 -40.0
Arkansas Retail_Recreation -0.3269512 -30.0
Maryland Retail_Recreation -0.3211834 -39.0
Idaho Workplace -0.3191705 -29.0
Virginia Transit -0.3175134 -33.0
Idaho Grocery_Pharmacy -0.3168106 -5.5
California Residential 0.3084436 14.0
Montana Transit 0.3074741 -41.0
California Transit -0.3045167 -42.0
Maine Retail_Recreation -0.3038669 -42.0
Nevada Residential 0.3027008 17.0
California Parks -0.2991494 -38.5
Illinois Workplace -0.2933922 -30.5
Colorado Residential 0.2929478 14.0
Wyoming Grocery_Pharmacy -0.2926152 -10.0
South Carolina Parks -0.2911161 -23.0
Florida Residential 0.2901461 14.0
Alaska Retail_Recreation 0.2875349 -39.0
Minnesota Transit -0.2865186 -28.5
Idaho Transit -0.2845171 -30.0
Wyoming Workplace -0.2830794 -31.0
Pennsylvania Parks 0.2792139 12.0
Vermont Retail_Recreation 0.2782373 -57.0
Oregon Grocery_Pharmacy -0.2775639 -7.0
Wyoming Transit -0.2689419 -17.0
North Carolina Grocery_Pharmacy 0.2685473 0.0
North Carolina Workplace 0.2655352 -31.0
Pennsylvania Grocery_Pharmacy -0.2654345 -6.0
Georgia Grocery_Pharmacy -0.2633300 -10.0
Kansas Workplace 0.2609616 -32.5
Vermont Workplace -0.2601713 -43.0
Rhode Island Grocery_Pharmacy 0.2579082 -7.5
Maryland Residential 0.2559636 15.0
Alabama Transit -0.2548113 -36.5
Mississippi Residential 0.2511330 13.0
Illinois Parks 0.2500433 26.5
Rhode Island Transit -0.2479675 -56.0
Tennessee Workplace -0.2478116 -31.0
Texas Workplace 0.2477673 -32.0
West Virginia Grocery_Pharmacy -0.2414539 -6.0
Florida Parks -0.2400281 -43.0
Tennessee Residential 0.2387085 11.5
Washington Grocery_Pharmacy 0.2362768 -7.0
Wisconsin Parks 0.2322063 51.5
Georgia Workplace -0.2302457 -33.5
New York Grocery_Pharmacy -0.2289018 8.0
Texas Residential -0.2248526 15.0
Minnesota Parks 0.2236673 -9.0
Missouri Workplace 0.2225327 -29.0
Georgia Retail_Recreation -0.2185902 -41.0
Hawaii Workplace 0.2184238 -46.0
Connecticut Parks 0.2152508 43.0
South Dakota Workplace 0.2147216 -35.0
Nevada Retail_Recreation -0.2103494 -43.0
Arizona Parks -0.2090472 -44.5
North Carolina Transit 0.2083002 -32.0
West Virginia Workplace 0.2056735 -33.0
Oregon Residential -0.2048459 10.5
North Dakota Grocery_Pharmacy -0.2020507 -8.0
Indiana Parks -0.2008507 29.0
Nevada Workplace -0.1999439 -40.0
Kansas Grocery_Pharmacy -0.1999298 -14.0
Nebraska Transit -0.1949971 -9.0
Mississippi Grocery_Pharmacy -0.1934143 -8.0
Illinois Residential 0.1916707 14.0
Colorado Parks -0.1916413 2.0
Tennessee Parks -0.1907621 10.5
Kentucky Transit -0.1823871 -31.0
Virginia Residential 0.1815482 14.0
Alabama Parks 0.1806021 -1.0
North Carolina Residential 0.1801741 13.0
Wisconsin Residential -0.1784243 14.0
Texas Parks 0.1734117 -42.0
Kentucky Grocery_Pharmacy 0.1724719 4.0
Utah Retail_Recreation -0.1721727 -40.0
Pennsylvania Transit -0.1707970 -42.0
Idaho Retail_Recreation -0.1689539 -39.5
New Hampshire Retail_Recreation -0.1680346 -41.0
Indiana Residential 0.1647059 12.0
Ohio Transit 0.1614432 -28.0
Massachusetts Parks 0.1612323 39.0
South Dakota Residential 0.1586762 15.0
New Jersey Residential 0.1561462 18.0
New Mexico Transit 0.1501835 -38.5
Iowa Transit 0.1472111 -24.0
Minnesota Retail_Recreation 0.1386392 -40.5
Michigan Workplace -0.1385343 -40.0
North Dakota Transit 0.1383314 -48.0
North Dakota Residential -0.1377901 17.0
Texas Grocery_Pharmacy 0.1367169 -14.0
Virginia Grocery_Pharmacy -0.1356063 -8.0
Missouri Parks 0.1341780 0.0
Indiana Retail_Recreation 0.1308234 -38.0
California Grocery_Pharmacy -0.1304283 -11.5
Missouri Transit -0.1252887 -24.5
Mississippi Retail_Recreation -0.1250649 -40.0
Arkansas Residential 0.1228100 12.0
Wisconsin Grocery_Pharmacy 0.1200599 -1.0
North Carolina Retail_Recreation 0.1195394 -34.0
Oklahoma Parks 0.1192417 -18.5
West Virginia Residential -0.1190854 11.0
Idaho Parks 0.1188234 -22.0
Kentucky Residential 0.1166247 12.0
Wisconsin Workplace -0.1159645 -31.0
Florida Retail_Recreation 0.1134600 -43.0
California Retail_Recreation -0.1115360 -44.0
Utah Grocery_Pharmacy 0.1101583 -4.0
Montana Retail_Recreation 0.1082546 -50.0
Montana Workplace -0.1077526 -40.0
Wyoming Residential 0.1076194 12.5
Michigan Residential 0.1070626 15.0
Maryland Transit -0.1066350 -39.0
Nebraska Retail_Recreation 0.1064528 -36.0
Mississippi Parks -0.1055566 -25.0
Arkansas Workplace -0.1049399 -26.0
Oklahoma Grocery_Pharmacy -0.1043888 -0.5
Kansas Transit -0.1035855 -26.5
Massachusetts Transit -0.1033202 -45.0
Oregon Retail_Recreation 0.1027857 -40.5
Iowa Workplace 0.1018614 -30.0
New Hampshire Grocery_Pharmacy -0.1008691 -6.0
Massachusetts Residential 0.1003879 15.0
Virginia Parks 0.0999071 6.0
Alabama Retail_Recreation 0.0991263 -39.0
Mississippi Transit -0.0963377 -38.5
Michigan Transit 0.0956990 -46.0
Indiana Workplace 0.0941518 -34.0
Minnesota Grocery_Pharmacy 0.0937549 -6.0
Oregon Workplace -0.0933846 -31.0
Michigan Retail_Recreation -0.0918149 -53.0
New York Residential 0.0907508 17.5
Ohio Residential 0.0901676 14.0
Iowa Grocery_Pharmacy -0.0883928 4.0
Oklahoma Workplace 0.0878819 -31.0
West Virginia Transit -0.0873222 -45.0
Alabama Residential -0.0858002 11.0
Texas Transit 0.0829276 -41.0
Georgia Residential -0.0823809 13.0
Virginia Retail_Recreation -0.0796062 -35.0
Nevada Parks 0.0786215 -12.5
Pennsylvania Residential 0.0777935 15.0
South Dakota Transit -0.0770130 -40.0
Virginia Workplace -0.0767439 -32.0
Minnesota Residential -0.0757907 17.0
Oregon Transit 0.0756402 -27.5
Kentucky Retail_Recreation 0.0751733 -29.0
Florida Transit -0.0748017 -49.0
Ohio Parks -0.0742580 67.5
Georgia Parks 0.0739677 -6.0
Ohio Grocery_Pharmacy 0.0718866 0.0
Texas Retail_Recreation 0.0623424 -40.0
Maine Residential -0.0616120 11.0
Vermont Transit -0.0583184 -63.0
Minnesota Workplace -0.0568999 -33.0
Washington Transit -0.0564257 -33.5
North Carolina Parks -0.0559940 7.0
Florida Grocery_Pharmacy 0.0547354 -14.0
Colorado Transit 0.0496487 -36.0
Ohio Retail_Recreation 0.0487401 -36.0
Georgia Transit -0.0472682 -35.0
South Dakota Retail_Recreation -0.0471844 -39.0
New Hampshire Workplace 0.0469066 -37.0
California Workplace -0.0457712 -36.0
Washington Retail_Recreation 0.0456369 -42.0
Kentucky Workplace -0.0438541 -36.5
Indiana Transit 0.0426119 -29.0
Iowa Retail_Recreation 0.0394486 -38.0
Ohio Workplace -0.0374366 -35.0
Illinois Grocery_Pharmacy -0.0353067 2.0
Arkansas Grocery_Pharmacy -0.0349518 3.0
Washington Parks 0.0341708 -3.5
Missouri Retail_Recreation -0.0340360 -36.0
Tennessee Transit 0.0333686 -32.0
Arizona Workplace -0.0324137 -35.0
Nebraska Parks 0.0313258 55.5
New Hampshire Transit 0.0312727 -57.0
Colorado Grocery_Pharmacy -0.0307375 -17.0
New Mexico Workplace 0.0303356 -34.0
Washington Workplace -0.0297964 -38.0
Colorado Retail_Recreation -0.0281544 -44.0
Wisconsin Retail_Recreation 0.0276291 -44.0
West Virginia Retail_Recreation -0.0262738 -38.5
South Carolina Grocery_Pharmacy 0.0261216 1.0
Washington Residential 0.0252693 13.0
Maine Grocery_Pharmacy -0.0240360 -13.0
Michigan Grocery_Pharmacy -0.0226211 -11.0
Tennessee Retail_Recreation -0.0201555 -30.0
Illinois Retail_Recreation 0.0194795 -40.0
Indiana Grocery_Pharmacy -0.0181920 -5.5
South Carolina Residential -0.0179439 12.0
South Carolina Transit -0.0173707 -45.0
Maryland Parks 0.0171922 27.0
Tennessee Grocery_Pharmacy 0.0159403 6.0
Wyoming Retail_Recreation 0.0158178 -39.0
Mississippi Workplace -0.0152403 -33.0
Kansas Residential -0.0150239 13.0
South Dakota Grocery_Pharmacy -0.0138903 -9.0
Colorado Workplace 0.0125797 -39.0
Oklahoma Transit 0.0115592 -26.0
Iowa Residential -0.0101969 13.0
Oklahoma Residential 0.0098935 15.0
South Carolina Retail_Recreation 0.0078966 -35.0
Montana Grocery_Pharmacy -0.0073289 -16.0
Florida Workplace 0.0063612 -33.0
Kansas Retail_Recreation -0.0060651 -38.0
Missouri Grocery_Pharmacy 0.0049274 2.0
Arkansas Transit -0.0034922 -27.0
Oklahoma Retail_Recreation -0.0030671 -31.0
Nevada Grocery_Pharmacy -0.0015128 -12.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Thu Jun 18 14:42:24 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net